Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
๐ Abstract
The article introduces the Long-Context Frontiers (LOFT) benchmark, a suite of six tasks spanning text, visual, and audio modalities to evaluate the performance of long-context language models (LCLMs) on real-world applications. The benchmark aims to push LCLMs to their limits and assess their potential to disrupt existing paradigms by eliminating the reliance on specialized tools and complex pipelines. The article presents the key insights from evaluating state-of-the-art LCLMs, including Gemini 1.5 Pro, GPT-4o, and Claude 3 Opus, on LOFT and comparing their performance to specialized models. It also introduces a novel prompting approach called Corpus-in-Context (CiC) Prompting, which enables LCLMs to directly ingest and process entire corpora within their context window.
๐ Q&A
[01] Text Retrieval
1. How do the performance of LCLMs compare to specialized retrieval models on text retrieval tasks? The results show that at the 128k token level, the largest size comparable across all models, LCLMs rival the performance of Gecko, a leading textual retrieval system. Notably, Gemini 1.5 Pro also surpasses strong multi-modal retrieval models such as CLIP.
2. How does the positioning of the gold documents in the corpus affect the performance of LCLMs on text retrieval tasks? The performance of LCLMs drops as the gold documents of the test queries are moved towards the end of the corpus, suggesting reduced attention in later sections of the prompt. Conversely, placing the gold documents of few-shot queries at the end improves recall, indicating their ability to mitigate attention weaknesses in this region.
3. What strategies can be used to overcome the performance degradation of LCLMs on longer context lengths? Co-locating gold documents of few-shot and test queries consistently boosts performance, as it gives the model information about where to look for the answer. This suggests that the model pays special attention to the locations where the gold documents for the few-shot examples are placed, regardless of where they are in the corpus.
[02] Visual Retrieval
1. How do the performance of LCLMs compare to specialized visual retrieval models? Gemini 1.5 Pro demonstrates comparable performance to CLIP, a widely used text-to-image retrieval model, across the visual retrieval datasets. This highlights the current capabilities of LCLMs in the visual retrieval domain.
2. What are the limitations of the current visual retrieval evaluation in LOFT? Due to the lack of suitable open-source image-to-text models, the evaluation of Claude 3 Opus on the visual retrieval task was not feasible, as the current API only supports up to 20 images per request.
[03] Audio Retrieval
1. How do the performance of LCLMs compare to specialized audio retrieval models? Gemini 1.5 Pro demonstrates comparable performance to PaLM 2 DE, a state-of-the-art audio retrieval model, across all 5 languages in the FLEURS audio retrieval datasets. This suggests that LCLMs can effectively handle audio retrieval tasks without specialized fine-tuning.
2. How does the performance of LCLMs scale with increasing context lengths on audio retrieval tasks? The results confirm Gemini 1.5 Pro's robust performance across various context lengths, highlighting the current capabilities of LCLMs in audio retrieval. However, the need for more challenging audio datasets is also indicated to further stress-test the limits of LCLMs.
[04] Retrieval-Augmented Generation (RAG)
1. How do the performance of LCLMs compare to specialized RAG pipelines on multi-hop reasoning tasks? Gemini 1.5 Pro, with the entire corpus in context, outperforms the RAG pipeline on multi-hop datasets (HotpotQA and MuSiQue). This is because LCLMs can reason over multiple passages in the context window using Chain-of-Thought, a capability that RAG pipelines typically lack.
2. What are the advantages of specialized retrieval models over LCLMs on multi-target reasoning tasks? A specialized retriever like Gecko excels at ranking all topically relevant passages from a corpus, enabling it to identify a comprehensive set of passages covering all answers. This proves particularly beneficial for multi-target datasets, such as QUEST and QAMPARI, where LCLMs lag behind.
[05] SQL-like Compositional Reasoning
1. How do the performance of LCLMs compare to specialized SQL pipelines on SQL-like reasoning tasks? LCLMs achieve reasonable performance on the SQL-like reasoning tasks, though they are significantly behind the specialized DAIL-SQL pipeline. This reveals substantial headroom to enhance the compositional reasoning capabilities of LCLMs.
2. What types of SQL operations are particularly challenging for LCLMs? The analysis shows that averaging is the most difficult operation for LCLMs, while counting is relatively easy. Moreover, reasoning over equality is considerably easier than reasoning over inequality.
[06] Many-Shot In-Context Learning (ICL)
1. How do the performance of different LCLMs compare on the many-shot ICL tasks? Gemini 1.5 Pro outperforms GPT-4o on most of the many-shot ICL benchmarks, except for the BBH-tracking7 dataset where Gemini performs surprisingly poorly. On average, Claude 3 Opus achieves the best performance among the LCLMs on this task.
2. How does increasing the number of examples in the prompt affect the performance of LCLMs on many-shot ICL tasks? The results show that performance improves monotonically with more examples on knowledge-intensive tasks like LIB-dialog. However, on reasoning-intensive tasks like BBH-tracking7 and BBH-web, scaling the number of in-context examples does not lead to consistent improvements, suggesting an earlier limit in how much models can learn from increasing the number of examples.